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ABSTRACT 

Functional dependencies (FDs) specify the intended data semantics 
while violations of FDs indicate deviation from these semantics. In 
this paper, we study a data cleaning problem in which the FDs may 
not be completely correct, e.g., due to data evolution or incomplete 
knowledge of the data semantics. We argue that the notion of rela- 
tive trust is a crucial aspect of this problem: if the FDs are outdated, 
we should modify them to fit the data, but if we suspect that there 
are problems with the data, we should modify the data to fit the 
FDs. In practice, it is usually unclear how much to trust the data 
versus the FDs. To address this problem, we propose an algorithm 
for generating non-redundant solutions (i.e., simultaneous modifi- 
cations of the data and the FDs) corresponding to various levels of 
relative trust. This can help users determine the best way to modify 
their data and/or FDs to achieve consistency. 

1. INTRODUCTION 

Poor data quality is a serious and costly problem, often addressed 
by specifying the intended semantics using constraints such as 
Functional Dependencies (FDs), and modifying or discarding in- 
consistent data to satisfy the provided constraints. For example, 
many techniques exist for editing the data in a non-redundant way 
so that a supplied set of FDs is satisfied 1 3 , 4 10 1. However, in prac- 
tice, it is often unclear whether the data are incorrect or whether the 
intended semantics are incorrect (or both). It is difficult to get the 
semantics right, especially in complex data-intensive applications, 
and the semantics (and/or the schema) may change over time. Thus, 
practical data cleaning approaches must consider errors in the data 
as well as errors in the specified constraints, as illustrated by the 
following example. 

EXAMPLE 1. Figure^depicts a relation that holds employee 
information. Data are collected from various sources (e.g., Pay- 
roll records, HR) and thus might contain inconsistencies due to, for 
instance, duplicate records and human errors. Suppose that we 
initially assert the FD Surname , GivenName — > Income. 
That is, whenever two tuples agree on attributes Surname and 
GivenName, they must agree on Income. This FD may hold for 
Western names, in which surname and given name may uniquely 
identify a person, but not for Chinese names (e.g., tuples t$ and tg 
probably refer to different people). Thus, we could change the FD 
to Surname, GivenName, BirthDate — > Income and 
resolve the remaining inconsistencies by modifying the data, i.e., 
setting the Income oftg (ortz) to be equal to that oftz ( resp. ts ). 

In [5], an algorithm was proposed to generate a single repair of 
both the data and the FDs, which has the lowest cost according 
to a unified cost model of modifying a tuple and modifying an 
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Figure 1 : An Example of a database instance of persons 



FD (i.e., adding an extra column to its left-hand-side). However, 
in practice, the data and the FDs are not always equally "trust- 
worthy". For example, FDs that were automatically discovered 
from legacy data may be less reliable than those manually spec- 
ified by a domain expert. Also, the reliability of data depends 
on many factors such as the data sources and extraction methods. 
Returning to Example [JJ trusting the FD more than data suggests 
changing the Income of tuples ts, ta and tin to be equal to the 
income of £3, tg and t%, respectively, while keeping the FD un- 
changed. Trusting the data more the than FD implies modifying 
the FD to be Surname, GivenName, Birthclate, Phone 
— > Income, while keeping the data unchanged. 

In this paper, we argue that the notion of relative trust between 
the data and the FDs must be taken into account when deciding how 
to resolve inconsistencies. We propose an algorithm that generates 
multiple suggestions for how to modify the data and/or the FDs in 
a minimal and non-redundant way, corresponding to various levels 
of relative trust. These suggested repairs can help users and domain 
experts determine the best way to resolve inconsistencies. 

Returning to Example [JJ it is not clear whether we should mod- 
ify the data alone, or add Birthdate (or also Phone) to the left- 
hand-side of the FD and resolve any remaining violations by modi- 
fying the data. Without complete knowledge of the data semantics, 
these possibilities may not be obvious by manually inspecting the 
data and the FDs. Computing various alternative fixes correspond- 
ing to different relative trust levels can give users a better idea of 
what could be wrong with their data and their constraints, and how 
to resolve problems. 

Implementing the proposed relative-trust-aware approach re- 
quires solving the following technical challenges: 

• Minimality of changes. As in previous work on data re- 
pair (3] |4| [6] [TO), the suggested modifications to the data and 
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the FDs should be (approximately) minimal and should avoid 
making redundant modifications to be meaningful. However, 
it is not obvious how to define non-redundancy and minimal- 
ity when both the data and the FDs can be modified, espe- 
cially if we want to produce multiple suggestions, not just a 
single one that is globally minimal according to a unified cost 
model. Furthermore, finding a data repair with the fewest 
possible changes is already NP-hard even if the FDs cannot 
be modified. 

• Specifying Relative Trust. In previous work on simultane- 
ously repairing the data and FDs [5], the level of relative trust 
was fixed and implicitly encoded in the unified cost model. 
Since we want to produce multiple suggested repairs corre- 
sponding to various levels of relative trust, we need to de- 
fine a semantically meaningful metric for measuring relative 
trust. 

In this paper, we address a data cleaning problem in which we 
are given a set of FDs and a data set that does not comply with 
the specified FDs, and we return multiple non-redundant sugges- 
tions (corresponding to different levels of relative trust) for how to 
modify the data and/or the FDs in order to achieve consistency. We 
make the following technical contributions: 

• We propose a simple definition of relative trust, in which a 
parameter r specifies the maximum number of allowed data 
changes; the smaller the r the greater the trust in the data. 
Using the notion of relative trust, we define a space of mini- 
mal FD and data repairs based on dominance with respect to 
t and the amount of changes to the FDs. 

• We give an efficient and effective algorithm for finding a 
minimal modification of a set of FDs such that no more than 
t data modifications will be required to satisfy the modified 
FDs. The algorithm prunes the space of possible FD modifi- 
cations using A* search combined with a novel heuristic that 
estimates the distance to an optimal set of FD modifications. 
Intuitively, this algorithm computes a minimal repair of the 
FDs for the relative trust level specified by r. We then give 
an algorithm that lists the required data modifications. Since 
computing the fewest data changes required to satisfy a set of 
FDs is NP-hard, we resort to approximation. The suggested 
repairs generated by our algorithm are provably close to our 
minimality criteria on the data changes. The approximation 
factor only depends on the number of FDs and the number of 
attributes. 

• Using the above algorithm as a subroutine, we give an al- 
gorithm for generating a sample of suggested repairs corre- 
sponding to a range of relative trust values. We optimize this 
technique by reusing repairs for higher values of r to obtain 
repairs for smaller r. 

Finally, we perform various experiments that justify the need to 
incorporate relative trust in the data cleaning process and we show 
order-of-magnitude performance improvements of the proposed al- 
gorithms over straightforward approaches. 

The remainder of the paper is organized as follows. Section [2] 
gives the notation and definitions used in the paper. In Section[3] we 
introduce the concepts of minimal repairs and relative trust. Sec- 
tion [4] introduces our approach to finding a nearly-minimal repair 
for a given relative trust value, followed by a detailed discussion 
of modifying the FDs in Section[5]and modifying the data in Sec- 
tion [6] Section [7] presents the algorithm for efficiently generating 



multiple suggested repairs. Section[8]presents our experimental re- 
sults, Section [9] discusses related work, and Section [TO] concludes 
the paper. 

2. PRELIMINARIES 

Let R be a relation schema consisting of m attributes, denoted 
{Ai, . . . , A m }. We denote by \R\ the number of attributes in R. 
Dom(A) denotes the domain of an attribute A G R. We assume 
that attribute domains are unbounded. An instance 7 of R is a set 
of tuples, each of which belongs to the domain Dom(Ai) x • ■ • x 
Dom(Am). We refer to an attribute A g R of a tuple t G 7 as a 
cell, denoted t[A]. 

For two attribute sets X, Y C R, a functional dependency (FD) 
X — > Y holds on an instance 7, denoted I |= X — > Y, iff for every 
two tuples ti,t 2 in 7, ti[X] = t 2 [X] implies ti[Y] = t 2 [Y\. Let 
E be the set of FDs defined over R. We denote by jEj the number 
of FDs in E. We say that 7 satisfies E, written I \= E, iff the tuples 
in I do not violate any FD in E. We assume that E is minimal (I], 
and each FD is of the form X — > A, where X C R and A G R. 

We use the notion of V-instances, which was first introduced in 
|10| , to concisely represent multiple data instances. In V-instances, 
cells can be set to variables that may be instantiated in a specific 
way. 

DEFINITION 1. V-instance. Given a set of variables 
{vi, V2,---} for each attribute A g R, a V-instance I of R is 
an instance of R where each cell t [A] in I can be assigned to ei- 
ther a constant in Dom(A), or a variable vf. 

A V-instance 7 represents multiple (ground) instances of R that 
can be obtained by assigning each variable vf to any value from 
Dom(A) that is not among the A-values already occurring in 7, 
and such that no two distinct variables and vf can have equal 
values. For brevity, we refer to a V-instance as an instance in the 
remainder of the paper. 

We say that a vector X = (xi, . . . , x^) dominates another vec- 
tor Y = (yi, ... , y k ), written X -< Y, iff for i G {1, . . . , k}, 
Xi < yi, and at least one element Xj in X is strictly less than the 
corresponding element yj in Y. 

3. SPACES OF POSSIBLE REPAIRS 

In this section, we define a space of minimal repairs of both data 
and FDs (Section [3.1[ > and we present our notion of relative trust 
(Section |3.2fr . 

3.1 Minimal Repairs of Data and FDs 

We consider data repairs that change cells in 7 rather than delet- 
ing tuples from 7. We denote by 5(7) all possible repairs of 7. All 
instances in 5(7) have the same number of tuples as 7. Because we 
aim at modifying a given set of FDs, rather than discovering a new 
set of FDs from scratch, we restrict the allowed FD modifications 
to those that relax (i.e., weaken) the supplied FDs. We do not con- 
sider adding new constraints. That is, E' is a possible modification 
of E iff 7 |= E implies 7 |= E', for any data instance 7. Given a set 
of FDs E, we denote by <S(E) the set of all possible modifications 
of E resulting from relaxing the FDs in E in all possible ways. We 
define the universe of possible repairs as follows. 

DEFINITION 2. Universe of Data and FDs Repairs. Given a 
data instance I and a set of FDs E, the universe of repairs of data 
and FDs, denoted U, is the set of all possible pairs (E', 7') such 
that E' G 5(E), 7' G 5(7), and I' |= E'. 
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We focus on a subset of U that are Pareto-optimal with respect 
to two distance functions: dist c (E, E') that measures the distance 
between two sets of FDs, and distd(I, I') that measures the dis- 
tance between two database instances. We refer to such repairs as 
minimal repairs, defined as follows. 

DEFINITION 3. Minimal Repair. Given an instance I and a set 
of FDs E, a repair (E', /') g U is minimal iff$(Y>", I") £ U such 
that {dist c {Y,,Y."),distd{I,I")) < (dist c (Y;,E'),dist d {I,r)). 

We deliberately avoid aggregating changes to data and changes 
to FDs into one metric in order to enable using various metrics for 
measuring both types of changes, which might be incomparable. 
For example, one metric for measuring changes in E is the number 
of modified FDs in E, while changes in / could be measured by the 
number of changed cells. Also, this approach provides a wide spec- 
trum of Pareto-optimal repairs that ranges from completely trusting 
/ (and only changing E) to completely trusting E (and only chang- 
ing I). 

For a repair I' of J, we denote by Ad(I, I') the cells that have 
different values in / and We use the cardinality of Ad(I, I') 
to measure the distance between two instances, which has been 
widely used in previous data cleaning techniques (e.g., (4||6||10|). 
That is, dist d (I,I') = \A d (1, 1% 

Recall that we restrict the modifications to E to those that relax 
the constraints in E. Thus, an FD F' is a possible modification 
of an FD F iff / |= F =>■ I |= F' , for any instance /. We use 
a simple relaxation mechanism: we only allow appending zero or 
more attributes to the left-hand-side (LHS) of an FD. Formally, an 
FD X ~ > A £ E can be modified by appending a set of attributes 
Y C (R \ XA) to the LHS, resulting in an FD XY -> A. We 
disallow adding A to the LHS to prevent producing trivial FDs. 

Note that different FDs in E might be modified to the same FD. 
For example, both A — > B and C — > B can be modified to AC — > 
B. Therefore, the number of FDs in any E' £ 5(E) is less than or 
equal to the number of FDs in E. We maintain a mapping between 
each FD in E, and its corresponding repair in E'. Without loss 
of generality, we assume hereafter that jE'j = |Ej by allowing 
duplicate FDs in E'. 

We define the distance between two sets of FDs as follows. 
For E = {Xi -> Ai,...,X z -> A z } and E' = {Y1X1 
Ai,...,Y z X z —¥ A z }, the term A C (E,E') denotes a vector 
(Yi, . . . ,Y Z ), which consists of LHS extensions to FDs in E in E'. 
To measure the distance between E and E', we use the function 
J]y 6A (j, E „ , w(Y), where w(Y) is a weighting function that de- 
termines the relative penalty of adding a set of attributes Y. The 
weighting function w(.) is intuitively non-negative and monotone 
(i.e., for any two attribute sets X and Y, X C Y implies that 
w(X) < w(Y)). A simple example of w(Y) is the number of at- 
tributes in Y . However, this does not distinguish between attributes 
that have different characteristics. Other features of appended at- 
tributes can be used for obtaining other definitions of w(.). For 
example, consider two attributes A and B that could be appended 
to the LHS of an FD, where A is a key (i.e., A — > R), while B 
is not. Intuitively, appending A should be more expensive that ap- 
pending B because the new FD in the former case is trivially sat- 
isfied. In general, the more informative a set of attributes is, the 
more expensive it is when being appended to the LHS of an FD. 
The information captured by a set of attributes can be measured us- 
ing various metrics, such as the number of distinct values of Y in 
/, and the entropy of Y. Another definition of w(Y) could rely on 
the increase in the description length for modeling / using FDs due 
to appending Y (refer to |5]|11|). 

In general, w(Y) depends on a given data instance to evaluate 



the weight of Y. Therefore, changing the cells in / during the 
repair generating algorithm might affect the weights of attributes. 
We make a simplifying assumption that w(Y) depends only on the 
initial instance I. This is based on an observation that the number 
of violations in I with respect to E is typically much smaller than 
the size of /, and thus repairing data does not significantly change 
the characteristics of attributes such as entropy and the number of 
distinct values. 

3.2 Relative Trust in Data vs. FDs 

We defined a space of minimal repairs that covers a wide spec- 
trum, ranging from repairs that only alter the data, while keeping 
the FDs unchanged, to repairs that only alter the FDs, while keep- 
ing the data unchanged. The idea behind relative trust is to limit 
the maximum number of cell changes that can be performed while 
obtaining I' to a threshold r, and to obtain a set of FDs E' that is 
the closest to E and is satisfied by J'. The obtained repair (E', /') 
is called a r-constrained repair, formally defined as follows. 

DEFINITION 4. r-constrained Repair Given an instance I, 
a set of FDs E, and a threshold r, a r-constrained repair 
(E',/') is a repair in U such that dist d {I , I') < T > an d no 
other repair (E", I") <E U has (dist c (T,, E"), distd(I, I")) -< 
(disi c (E, E'), r). 

In other words, a r-constrained repair is a repair in U whose 
distance to / is less than or equal to r, and which has the minimum 
distance to E across all repairs in U with distance to I also less 
than or equal to r. We break ties using the distance to / (i.e., if 
two repairs have an equal distance to E and have distances to I less 
than or equal to r, we choose the one closer to I). 

Possible values of r range from to the minimum number of 
cells changes that must be applied to I in order to satisfy E, denoted 
<5 op t(E, I). We can also specify the threshold on the number of 
allowed cell changes as a percentage of S op t (E, I), denoted r r (i.e., 
T r = r/(5 opt (E, J)). 

The mapping between minimal repairs and r-constrained repairs 
is as follows. (1) Each r-constrained repair is a minimal repair; 
(2) All minimal repairs can be found by varying the relative trust 
r in the range [0, <5 op t(E, /)], and obtaining the corresponding r- 
constrained repair. Specifically, each minimal repair (E',J') is 
equal to a r-constrained repair, where r is in the range defined 
as follows. Let (E", I") be the minimal repair with the smallest 
distd(I, I") that is strictly greater than distd(I, I')- If such a re- 
pair does not exist, let (E", J") be (<j!>, (j>). The range of r is defined 
as follows. 

r [dist d (i, i'), dist d (i, i")) if (s", i") / (0, & 

rel (1) 
[ [dist d {I, I'),oo) if (E", I") = (4>, (f>) 

If(E",J") = (4>, <j>), the range [dist d (I, I'), oo) corresponds to 
a unique minimal repair where distd(I, I') is equal to <5 op t(E, 7). 
We prove these two points in the following theorem (proof is in 
Appendix | A. 1} , 

THEOREM 1. Each r-constrained repair is a minimal repair. 
Each minimal repair (E ,1 ) corresponds to a r-constrained re- 
pair, where r belongs to the range defined in Equation^ 

4. COMPUTING A SINGLE REPAIR FOR 
A GIVEN RELATIVE TRUST LEVEL 
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There is a strong interplay between modifying the data and the 
FDs. Obtaining a data instance that is closest to I, while satisfying 
a set of FDs £' highly depends on E'. Also, obtaining a set of FDs 
£' that is closest to £, such that £' holds in a given data instance 
I' highly depends on the instance This interplay represents the 
main challenge for simultaneously modifying the data and the FDs. 

For example, consider a simple approach that alternates between 
editing the data and modifying the FDs until we reach consistency. 
This may not give a minimal repair (e.g., we might make a data 
change in one step that turns out to be redundant after we change 
one of the FDs in a subsequent step). Furthermore, we may have to 
make more than r cell changes because it is difficult to predict the 
amount of necessary data changes while modifying the FDs. 

Our solution to generating a minimal repair for a given level of 
relative trust consists of two steps. In the first step, we modify 
the FDs to obtain a set £' that is as close as possible to £, while 
guaranteeing that there exists a data repair I' satisfying E' with 
a distance to / less than or equal to r. In the second step, we 
materialize the data instance I' by modifying I with respect to £' 
in a minimal way. We describe this approach in Algorithm[TJ 

Finding £' in the first step requires computing the minimum 
number of cell changes in / to satisfy E' (i.e., <5 op t(E', /)). Note 
that computing <5 op t (£',/) does not require materialization of an 
instance /' that satisfies £' and have the minimum number of 
changes. Instead, we collect enough statistics about the violations 
in data to compute <5 opt (£', /). We will discuss this step in more 
detail in Section|5] Obtaining a modified instance I' in line 3 will 
be discussed in Section|6] 

Algorithm 1 Repair_Data_FDs (E, I , r) 

1: obtain E' from S(E) such that <5 op i(E', I) < r, and no other E" e 
5(E) with <5 opt (E", I) <t has dist c (S, E") < dist c (E, E'). (ties 
are broken using <5 op t(E', /)) 

2: if E' ^ 0then 

3: obtain /' that satisfies E' while performing at most <5 op t(E', /) cell 

changes, and return (E', /'). 
4: else 

5: Return (<f>, rf>) 
6: end if 



The following theorem establishes the link between the repairs 
generated by Algorithm [TJ and Definition [4] The proof is in Ap- 
pendix [AT2] 

THEOREM 2. Repairs generated by Algorithm [7] are r- 
constrained repairs. 

A key step in Algorithm [T] is computing 5 op t(T,',I) (i.e., the 
minimum number of cells in / that have to be changed in order 
to satisfy £'). Unfortunately, computing the exact minimum num- 
ber of cell changes when £' contains at least two FDs is NP-hard 
(TO). We will propose an approximate solution based on upper- 
bounding the minimum number of necessary cell changes. Assume 
that there exists a P-approximate upper bound on 5 op t(E', /), de- 
noted <5p(£',7) (details are in Section|6}. That is, 5 opt (E', I) < 
Sp(T,',I) < P ■ 5 op t(£',J), for some constant P. By using 
<5p(£',J) in place of 5 op t(E',J) in Algorithm]!] we can satisfy 
the criteria in Definition [4] in a P-approximate way. Specifically, 
the repair generated by Algorithm [TJbecomes a P-approximate r- 
constrained repair, which is defined as follows (the proof is similar 
to Theorem [2}. 

DEFINITION 5. P-approximate r-constrained Repair Given 
an instance I, a set of FDs £, and a threshold r, a P- 
approximate r -constrained repair (£',/') is a repair in U such 
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that distd(I , I') < t, and no other repair (£",/") £ U has 
(dist c (T,, £"), P • dist d {I, I")) -< (dist c (£, £'), r). 

In the remainder of this paper, we present an implementation 
of line 1 (Section [5} and line 3 (Section |6]l of Algorithm [Tj Our 
implementation is P-approximate, as defined above, with P = 2 ■ 
minjjPj — 1, |£|}, where \R\ denotes the number of attributes in 
R, and j£j denotes the number of FDs in £. 

5. MINIMALLY MODIFYING THE FDS 

In this section, we show how to obtain a modified set of FDs 
£' that is part of a P-approximate r-constrained repair (line 1 of 
Algorithm [TJ(. That is, we need to obtain £' G S(E) such that 
6 P (£',I) < r, and no other FD set£" e 5(E) with 5 P (E", I) < 
t has dist c (E, E") < dist c (E, £'). 

First, we need to introduce the notion of a conflict graph of I 
with respect to E, which was previously used in (2): 

DEFINITION 6. Conflict Graph. A conflict graph of an in- 
stance I and a set of FDs E is an undirected graph whose set of 
vertices is the set of tuples in I, and whose set of edges consists of 
all edges (ti, tj) such that ti and tj violate at least one FD in E. 

Figure [2] shows an instance /, a set of FDs E, and the corre- 
sponding conflict graph. The label of each edge represents the FDs 
that are violated by the edge vertices. 

In Section [6] we present an algorithm to obtain an instance re- 
pair I' that satisfies a set of FDs E' £ 5(E). The number of cell 
changes performed by our algorithm is linked to the conflict graph 
of E' and I as follows. Let C2opt(E', /) be a 2-approximate mini- 
mum vertex cover of the conflict graph of E' and /, which we can 
obtain in PTIME using a greedy algorithm |7 |. The number of cell 
changes performed by our algorithm is at most a ■ jC2 op t(E', J)|, 
where a — min{|P| — 1,|E|}. Moreover, we prove that the 
number of changed cells is 2a-approximately minimal. There- 
fore, we define <5p(E', I) as a ■ |C2o P t(E', which represents 
a 2a-approximate upper bound of 8 opt (T,' ,1) that can be com- 
puted in PTIME. Based on the definition of <5p(E',I), our goal 
in this section can be rewritten as follows: obtain E' £ 5(E) such 
that C 2opt (E',I) < ~, and no other FD set E" G 5(E) with 
C 2opt (E",/) < I has a dist c (£,E") < dist c (E,£'). 

Figure [5] depicts several possible modifications of E from Fig- 
ureFi] along with dist c (Y,, £') (assuming that the weighting func- 
tion w(Y) is equal to \Y\), the corresponding conflict graph, 
C2 op t(E',I), and <5p(E',J). For r = 2, the modifications of E 
that are part of P-approximate r-constrained repairs are {CA — > 
B, C -> D} and {DA -> B, C ->• D}. 

5.1 Searching the Space of FD Modifications 

We model the possible FD modifications 5(E) as a state space, 
where for each E' £ 5(E), there exists a state representing 
A C (E, E') (i.e., the vector of attribute sets appended to LHSs of 
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Figure 3: An example of multiple FD repairs 
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Figure 4: The state search space for R = {A, B, C, D, E, F} 
and E = {.4 — > F} (a) a graph search space (b) a tree search 
space 

FDs to obtain E'). Additionally, we call A C (E, E') a goal state 
iff Sp(E',I) < t, for a given threshold value r (or equiva- 
lently, C 2op t(E', /) < ^). The cost of a state A C (E, E') is equal 
to disi c (E, E'). We assume that the weighting function w(.) is 
monotone and non-negative. Our goal is to locate the cheapest goal 
state for a given value of r, which amounts to finding an FD set E' 
that is part of a P-approximate r-constrained repair. 

The monotonicity of the weighting function w (and hence the 
monotonicity of the overall cost function) allows for pruning a large 
part of the state space. We say that a state (Yi , . . . ,Y Z ) extends an- 
other state (V/, . . . , Y z ), where z — |E|, iff for alii 6 {1, ... , z}, 
Y( != Yi. Clearly, if (Yl, . . . , Y z ) is a goal state, we can prune all 
the FD sets that extend it because w(.) is monotone. 

In Figure |4ja), we show all the states for R — 
{A,B,C,D,E,F} and E = {A -> F}. Each arrow in 
Figure |4ja) indicates that the destination state extends the source 
state by adding exactly one attribute. We can find the cheapest goal 
state by traversing the graph in Figure |4ja). For example, we can 




Figure 5: A space for 7? = {A,B,C,D} and E = {A -> 
B,C^D} 



use a level- wise breadth-first search strategy 1 1 3| , which iterates 
over states with the same number of attributes, and, for each such 
set of states, we determine whether any state is a goal state. If one 
or more goal states are found at the current level, we return the 
cheapest goal state and terminate the search. 

We can optimize the search by adopting best-first traversal of 
the states graph 1 13). That is, we maintain a list of states to be 
visited next, called the open list, which initially contains the state 
(</>,...,</>), and a list of states that have been visited, called the 
closed list. In each iteration, we pick the cheapest state S from the 
open list, and test whether S is a goal state. If S is a goal state, 
we return it and terminate the search. Otherwise, we add S to the 
closed list, and we insert into the open list all the states that extend 
S by exactly one attribute and are not in the closed list. 

We can avoid using a closed list that keeps track of visited states, 
and hence reduce the running time, by ensuring that each state can 
only be reached from the initial state (<f>, . . . , <f>) using a unique 
path. In other words, we need to reduce the graph in Figure |4ja) 
to a tree (e.g., Figure |4|b)). To achieve this, we assign each state, 
except (</>,...,(/>), to a single parent. Assume that attributes in R 
are totally ordered (e.g., lexicographically). For E with a single 
FD, the parent of a state Y is another state Y \ {A} where A is 
the greatest attribute in Y. Figure |4jb) shows the search tree that 
is equivalent to the search graph in Figure [4j a). In general, when 
E contains multiple FDs, the parent of a state (Yi, . . . , Y z ) is de- 
termined as follows. Let A be the greatest attribute in Ui=i Yi, 
and j be the index of the last element in the vector (Yi, . . . , Y z ) 
that contains A. The parent of the state (Yi, . . . , Y z ) is another 
state (Yl, . . . , Y?-ii Yj \ {^li Yj+i, . . . , Y z ). Figurepjdepicts an 
example search space for the two FDs shown in Figurep] 

5.2 A*-based Search Algorithm 

One problem with best-first tree traversal is that it might visit 
cheap states that only lead to expensive goal states or no goal states 
at all. A* search |13| avoids this by estimating the cost of the 
cheapest goal state reachable (i.e., descending) from each state S in 
the open list, denoted gc(S'), and visiting the state with the small- 
est gc(S) first. In order to maintain soundness of the algorithm 
(i.e., returning the cheapest goal state), we must not overestimate 
the cost of the cheapest goal state reachable from a state S |13| . 

Algorithm[2]describes the search procedure. The goal of lines 1 
and 12-16, along with the sub-procedure getDescGoalStates, 
is computing gc(S'). The reminder of Algorithm [5] follows the 
A* search algorithm: it initializes an open list, which is imple- 
mented as a priority queue called PQ, by inserting the root state 
(0,...,</>). In each iteration, the algorithm removes the state with 
the smallest value of gc(S) from PQ and checks whether it is a 
goal state. If so, the algorithm returns the corresponding FD set. 
Otherwise, the algorithm inserts the children of the removed state 
into PQ, after computing gc(.) for each inserted state. 

The two technical challenges of computing gc(S) are the tight- 
ness of the bound gc(S) (i.e., being close to the actual cost of the 
cheapest goal state descending from S), and having a small compu- 
tational cost. In the following, we describe how we address these 
challenges. 

Given a conflict graph G of I and E, each edge represents two 
tuples in I that violate E . For any edge (ti, t j) in G, we refer to the 
attributes that have different values in ti and tj as the difference set 
of (ti,tj). Difference sets have been introduced in the context of 
FD discovery (e.g., (12||14| ). For example, the difference sets for 
(ti,t 2 ), (ts, ts), and (t 3 ,U) in Figure|2]are BD, AD, and BCD, 
respectively. We denote by T> the set of all difference sets for edges 
in G (line 1 in Algorithm [2}. The key idea that allows efficient 
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Algorithm 2 Modify_FDs (E,7,t) 



Algorithm 3 getDescGoalStates(S, S c , G c , T> c , r) 



1: construct the conflict graph G of E and I, and obtain the set of all 

difference sets in G, denoted T> 
2: PQ «-{(>,..., 0)} 
3 : while PQ is not empty do 

4: pick the state Sh with the smallest value of gc(.) from PQ 
5: let S/j be the FD set corresponding to S^ 
6: Compute C^optC^h, I) 

7: if \C 2opt (£ h ,I)\-mm{\R\ - 1, |E|} < rthen 
8: return E^ 
9: end if 

10: remove Sj, from PQ 

1 1 : for each state Si that is a child of 5"^ do 

12: let be the FD set corresponding to Si 

13: let T> a be the subset of difference sets in T> that violate E^ 

14: let Go be an empty graph 

15: minStates <— getDescGoalStates(S;, Si, Go,T>s,t) 

16: set gc(Si) to the minimum cost across all states in minStates, 

or oo if minStates is empty 
17: if gc(Sj) is not oo then 
18: insert Si into PQ 

19: end if 
20: end for 
2 1 : end while 
22: return <f> 



computation of gc(S) is that all edges (i.e., violations) in G with 
the same difference set can be completely resolved by adding one 
attribute from the difference set to the LHS of each violated FD 
in E. For example, edges corresponding to difference set BD in 
Figure [2] violate both A — > B and C —> D, and to fix these vi- 
olations, we need to add D to the LHS of the first FD, and B to 
the LHS of the second FD. Similarly, fixing violations correspond- 
ing to difference set BCD can be done by adding C or D to the 
first FD (second FD is not violated). Therefore, we partition the 
edges of the conflict graph G based on their difference sets. In or- 
der to compute gc(S), each group of edges corresponding to one 
difference set is considered atomically, rather than individually. 

Let D s be a subset of difference sets that are still violated at the 
current state Si (line 13). Given a set of difference sets T> s , the re- 
cursive procedure getDescGoalStates(S, S c , G c ,T) c ,t) (Al- 
gorithm |3| finds all minimal goal states descending from S that 
resolve T> c , taking into consideration the maximum number of al- 
lowed cell changes r. Therefore, gc(S) can be assigned to the 
cheapest state returned by the procedure getDescGoalStates. 
Note that we use a subset of difference sets that are still violated 
(T> a ), instead of using all violated difference sets, in order to effi- 
ciently compute gc(S). The computed value of gc(S) is clearly a 
lower bound on the cost of actual cheapest goal state descending 
from the current state S. To provide tight lower bounds, T> s is se- 
lected such that difference sets corresponding to large numbers of 
edges are favored. Additionally, we heuristically ensure that the 
difference sets in D s have a small overlap. 

We now describe Algorithm [3] It recursively selects a differ- 
ence set d from the set of non-resolved difference sets T> c . For 
each difference set d, we consider two alternatives: (1) excluding 
d from being resolved, if threshold r permits, and (2) resolving d 
by extending the current state S c . In the latter case, we consider all 
possible children of S c to resolve d. Once S c is extended to S' c , we 
remove from T> c all the sets that are now resolved, resulting in T)' c . 
Due to the monotonicity of the cost function, we can prune all the 
non-minimal states from the found set of states. That is, if state Si 
extends another state 5*2 and both are goal states, we remove Si . 

In the following lemma, we prove that the computed value of 
gc(S) is a lower bound on the cost of the cheapest goal descending 



Require: S : the state for which we compute gc(.) 

Require: S c '■ the current state to be extended (equals S at the first entry) 
Require: G c '■ the current conflict graph for non-resolved difference sets 

(is empty at the first entry) 
Require: T> c : the remaining difference sets to be resolved 

1 : if T> c is empty then 

2: return {S c } 

3: end if 

4: States -f— <f> 

5: select a difference set d from T> c 

6: let G' c be the graph whose edges are the union of edges corresponding 

to d and edges of G c 
7: compute a 2-approximate minimum vertex cover of G' c , denoted Ciopt 

8: if \C 2op t\ •min{|P| - 1,|E|} < rthen 
9: V' C ^V C \ {d} 

10: States States U getDescGoalStates(5, S c , G' c , T>' c ,r) 
11: end if 

12: for each possible state S' c that extends S c , is descendant of S, and 

resolves violations corresponding to d do 
13: let T>' c be all difference sets in T> c that are still violating E c that is 
corresponding to S' c 

14: States ^- States U getDescGoalStates(5, S' c , G c , T>' c , r) 
1 5 : end for 

16: remove any non-minimal states from States 
17: return States 



from state S. The proof is in Appendix |A.3| 

LEMMA 1. For any state S, gc(S) is less than or equal to the 
cost of the cheapest goal state descendant of S. 

Based on Lemma [7] and the correctness of the A* search algo- 
rithm | OJ, we conclude that the FD set generated by Algorithm [2] 
is part of a P-approximate r-constrained repair. 

We now discuss the complexity of Algorithms [2] and [3] Finding 
all difference sets in line 1 in Algorithms[2]is performed in 0(|E ■ 
n+ |E| ■ \E\ + \R\ ■ \E\), where n denotes the number of tuples in /, 
and E denotes the number of edges in the conflict graph of / and E. 
Difference sets are obtained by building the conflict graph of / and 
E, which costs 0(|E| • n+ |E| • \E\) (more details are in Section^, 
and then computing the difference set for all edges, which costs 
0(\R\ ■ \E\). In worst case, Algorithm [5] which is based on A* 
search, will visit a number of states that is exponential in the depth 
of the cheapest goal state |13) , which is less than |E| • (|P| — 2). 
However, the number of states visited by an A* search algorithm 
is the minimum across all algorithms that traverse the same search 
tree and use the same heuristic for computing gc(S). Also, we 
show in our experiments that the actual number of visited states is 
much smaller than the best-first search algorithm (Section [8j. 

The worst-case complexity of Algorithm [5] that finds gc(S) is 
0(|.E| • |i?|' s '''' Do '), where \T> C \ is the number of difference sets 
passed to the algorithm. This is due to recursively inspecting each 
difference set in T> c and, if not already resolved by the current state 
S c , appending one more attribute from the difference set to the LHS 
of each FD. At each step, approximate vertex graph cover might 
need to be computed, which can be performed in 0(|-E|). 

6. NEAR-OPTIMAL DATA MODIFICA- 
TION 

In this section, we derive a P-approximation of <5 op t(E', 7), de- 
noted c5p(E', 7), where P = 2 ■ min{|Pj - 1, E|}. We also give 
an algorithm that makes at most (5p(E', 7) cell changes in order to 
resolve all the inconsistencies with respect to the modified set of 
FDs computed in the previous section. 
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There are several data cleaning algorithms that obtain a data re- 
pair for a fixed set of FDs, such as (4] [6] |10| . Most approaches 
do not provide any bounds on the number of cells that are changed 
during the repairing process. In |10| , the proposed algorithm pro- 
vides an upper bound on the number of cell changes and it is proved 
to be near-minimum. The approximation factor depends on the set 
of FDs E, which is assumed to be fixed. Unfortunately, we need to 
deal with multiple FD sets, and the approximation factor described 
in 1 10] can grow arbitrarily while modifying the initial FD set. That 
is, the approximation factors for two possible repairs E',E" in 
S(E) can be different. In this section, we provide a method to 
compute 5p(E' , 7) such that the approximation factor is equal to 
2 • min{|7?| — 1, |E|}, which depends only on the number of at- 
tributes in R and the number of FDs in E. 

The output of our algorithm is a V-instance, which was first intro- 
duced in [ 10 1 to concisely represent multiple data instances (refer 
to Section [2] for more details). In the remainder of this paper, we 
refer to a V-instance as simply an instance. 

The algorithm we propose in this section is a variant of the data 
cleaning algorithm proposed in (3j. The main difference is that we 
clean the data tuple-by-tuple instead of cell-by-cell. That is, we first 
identify a set of clean tuples that satisfy E' such that the cardinal- 
ity of the set is approximately maximum. We convert this problem 
to the problem of finding the minimum vertex cover, and we use a 
greedy algorithm with an approximation factor of 2. Then, we iter- 
atively modify violating tuples as follows. For each violating tuple 
t, we iterate over attributes of t in a random order, and we modify 
each attribute, if necessary, to ensure that the attributes processed 
so far are clean. 

Given a set of FDs E', the procedure Repair_Data in Algo- 
rithm|4]generates an instance I' that satisfies E'. Initially, the algo- 
rithm constructs the conflict graph of 7 and E'. Then, the algorithm 
obtains a 2-approximate minimum vertex cover of the obtained 
conflict graph, denoted C2 op t(E', 7), using a greedy approach de- 
scribed in [ 7 1 (for brevity, we refer to Ci ov t (E' , 7) as Ci ov t in this 
section). The clean instance 7' is initially set to 7. The algorithm 
repeatedly removes a tuple t from C2o P t, and it changes attributes 
of t to ensure that, for every tuple t' 6 7' \ C2 op t, t and t! do not 
violate E' (lines 5-15). This is achieved by repeatedly picking an 
attribute of t at random, and adding it to a set denoted Fixed_Attrs 
(line 9). After inserting an attribute A, we determine whether we 
can find an assignment to the attributes outside Fixed_Attrs such 
(i, t') are not violating E', for all t' G 7' \ C-2o V t- We use Algo- 
rithm[5]to find a valid assignment, if any, or to indicate that no valid 
assignment exists. Note that when Fixed_Attrs contains only one 
attribute (line 6), it is guaranteed that a valid assignment exists (line 
7). If a valid assignment is found, we keep t[A] unchanged. Oth- 
erwise, we change t[A] to the value of attribute A of the valid as- 
signment found in the previous iteration (line 11). The algorithm 
proceeds until all tuples have been removed from C2 op t- We return 
7' upon termination. 

Algorithm[5]searches for an assignment to attributes of a tuple t 
that are not in Fixed_Attrs such that every pair (t, t') satisfies E' 
for all t' £ I' \ C2o P t- An initial assignment t c is created by set- 
ting attributes that are in Fixed_Attrs to be equal to t, and setting 
attributes that are not in Fixed_Attrs to new variables. The algo- 
rithm repeatedly selects a tuple t' £ I'\ C2 op t such that (t, t') vio- 
lates an FD X -> A G E'. If attribute A belongs to Fixed-Attrs, 
the algorithm returns (f>, indicating that no valid assignment is avail- 
able. Otherwise, the algorithm sets t[A] to be equal to t'[A], and 
adds A to Fixed_Attrs. When no other violations could be found, 
the algorithm returns the assignment te- 
la Figure [6] we show an example of generating a data repair for 



Algorithm 4 Repair_Data (E', 7) 



let G be the conflict graph of I and £' 

obtain a 2-approximate minimum vertex cover of G, denoted C2opt 
I' <- I 

while C2opt is not empty do 

randomly pick a tuple t from C2opt 

Fixed_Attrs <— {A}, where A is arandomly picked attribute from 
R 

t c <— Find_Assignment(t, Fixed_Attrs, I' , C2opt) 
while \Fixed_Attrs\ < \Ft.\ do 

randomly pick an attribute A from R \ Fixed.Atts and insert it 

into Fixed_Attrs 

if Find_Assignment(t, Fixed_Attrs, I', C2opt) = <t> 
then 

t[A] <- t c [A] 
else 

t c Find_Assignment(t, Fixed_Attrs, I' , £', C2opt) 
end if 
end while 

remove t from C2opt 
end while 
return 7' 



Algorithm 5 Find-Assignment (7, Fixed_Attrs, 7', E', C2 op t) 

1: construct a tuple t c such that t c [A] = t[A] if A 6 Fixed_Attrs, and 

t c [A] = if A Fixed_Attrs, where vf- is a new variable 
2: while 3t' G 7'\C 2o pt such that for some FD X -> A G £',i c [X] = 
t'[X] A t c [A] ^ t'[A] do 



if A G Fixed_Attrs then 

return (f> 
else 

t c [A]<-t'[A] 
add A to Fixed_Attrs 
end if 
end while 



10: return t c 
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t c = ( Vl A ,2, Vl c , Vl D ) 

t c = (v/,2,1,1) 
t = (v/,2,1,1) 
t c = (v/,2,1,1) 



Z'= {CA->B, C->D} 

C 2 op, = {t 2 } 
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(b) 



Figure 6: An example of repairing data: (a) initial value of I', 

E' and C2o P t (b) steps of fixing the tuple t2 



E' = {CA — > B, C — ?> D}, given the instance 7 shown in Fig- 
ure [6j a). After adding the first attribute B to Fixed_Attrs, the 
current valid assignment, denoted t c , is equal to (vi, 2, , iif). 
When inserting C to Fixed-Attrs, there is no need to change 
the value of C because we can find a valid assignment to the re- 
maining attributes, which is (v^, 2, 1, 1). After inserting A to 
Fixed_Attrs, no valid assignment is found, and thus we set t [A] to 
the value of attribute A of the previous valid assignment t c . Sim- 
ilarly, we set t[D] to t c [D] after inserting D into Fixed-Attrs. 
The resulting instance satisfies E'. The following lemma proves 
the soundness and completeness of Algorithm [5] The proof is in 
Appendix | A.4| 

LEMMA 2. Algorithm^is both sound (i.e., the obtained assign- 
ments are valid) and complete (it will return an assignment if a 
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valid assignment exists). 

The following theorem proves the P-optimality of Algorithm]?] 
The proof is in Appendix |A.5| 

THEOREM 3. For a given instance I and a set of FDs E' £ 
5(E), Algorithm Repair-Data (E' , 1 ) obtains an instance I' \= 
E' such that the number of changed cells in I is at most 
|C2o P t(S',/)|Tnin{|_R| - 1, |E|}, and it is 2 ■ min{\R\ - 1, E|}- 
approximate minimum. 

We now describe the worst-case complexity of Algorithms [4] and 
[5] Algorithm [5] has a complexity of 0(\R\ + E'|) because con- 
structing t c in line 1 costs 0(\R\), and the loop in lines 2-9 iterates 
at most |E'| times. The reason is that, for each FD X — > A £ E', 
there is at most one tuple in 7' \ Ci ov t satisfying the condition in 
line 2 (otherwise, tuples in 7' \ Cio V t would be violating X — s> ^4). 

Constructing the conflict graph in line 1 in Algorithm [4] takes 
0(|E'| -n+|E'| ■ \E\), where |E'| is the number of FDs inE',nis 
the number of tuples in 7 and E is the set of edges in the resulting 
conflict graph. This step is performed by partitioning tuples in 7 
based on LHS attributes of each FD in E' using a hashing func- 
tion, and constructing sub-partitions within each partition based on 
right-hand-side attributes of each FD. Edges of the conflict graph 
are generated by emitting pairs of tuples that belong to the same 
partition and different sub-partitions. The approximate vertex cover 
is computed in 0(|75|) 17). The loop in lines 4-17 iterates a number 
of times equal to the size of the vertex cover, which is 0(n). Each 
iteration costs 0(|-R| • (| R\ + |E'|)). To sum up, the complexity of 
finding a clean instance 7' is O ( | E' | ■ 1 7; | + 1 7? | 2 ■ n + | i?| ■ | E' | • n) . 
Assuming that \R\ and |E'| are much smaller than n, the complex- 
ity is reduced to 0(|7?j + n). 

7. COMPUTING MULTIPLE REPAIRS 

So far, we discussed how to modify the data and the FDs for 
a given value of r. One way to obtain a small sample of possi- 
ble repairs is to execute Algorithm[T]multiple times with randomly 
chosen values of r. This can be easily parallelized, but may be in- 
efficient for two reasons. First, multiple values of r could result in 
the same repair, making some executions of the algorithm redun- 
dant. Second, different invocations of Algorithm [2] are expected 
to visit the same states, so we should be able to re-use previous 
computations. To overcome these drawbacks, we develop an algo- 
rithm (Algorithm [6} that generates minimal FD modifications cor- 
responding to a range of r values. We then use Algorithm|4]to find 
the corresponding data modifications. 

Algorithm [6] generates FD modifications corresponding to the 
relative trust range r £ [r;,r u ]. Initially, r = t u . We proceed 
by visiting states in order of gc(.), and expanding PQ by insert- 
ing new states. Once a goal state is found, the corresponding FD 
modification Eh is added to the set of possible repairs. The set Eh 
corresponds to the trust range [<5p(Eh, 7), r]. Therefore, we set the 
new value of r to 5p(Eh, 7) — 1 in order to discover a new repair. 
Because gc(.) depends on r, we recompute gc(.) for all states in 
PQ. Note that states that have been previously removed from PQ 
because they were not goal states (line 13) cannot be goal states 
with respect to the new value of r. The reason is that if a state is 
not a goal state for r = x, it cannot be a goal state for r < x (refer 
to line 8). The algorithm terminates when PQ is empty, or when 
t < Ti. Finally, we take all the FD modifications we found (or a 
sample of them if we have found too many), and we generate the 
corresponding data modifications. 



Algorithm 6 Finci_Repairs_FDs (E, 7, ti,t u ) 

1: PQ <-{(<£,..., 0)} 
2: t <— r u 

3: FD .Repairs <— c/> 

4: while PQ is not empty and r > r; do 

5: Pick the state Sh with the smallest value of gc(.) from PQ 
6: Let Eh be the FD set corresponding to Sh 
7: Compute C2opt{^h, F) 

8: if \C 2op t(Z h , I)\ ■min{|R| - 1, |E|} < r then 
9: Add Eh to FD.Repairs 

10: r 4r- |C 2 o P t(Eh, 7)| • min{|P| - 1, |E|} - 1 
11: For each state Si 6 PQ, recompute gc(Si) using the new value 

of r 

12: end if 

13: Remove Sh from PQ 

14: for each state Si that is a child of Sh do 

15: Compute gc(Si) (similar to Algorithm|2j 

16: Insert Si into PQ 

17: end for 

18: end while 

19: return FD.Repairs 



8. EXPERIMENTAL EVALUATION 

In this section, we study the relationship between the quality of 
repairs and the relative trust determined by r, and we compare our 
approach to the technique introduced in (5). Also, we show the 
efficiency of our repair generating algorithms. 

8.1 Setup 

All experiments were conducted on a SunFire X4100 server with 
a Quad-Core 2.2GHz processor, and 8GB of RAM. All computa- 
tions are executed in memory. Repairing algorithms are executed 
as single-threaded processes, and we limit memory usage to 1.5GB. 
We use a real data set, namely the Census-Income data se£] which 
is part of the UC Irvine Machine Learning Repository. Census- 
Income consists of 300k tuples and 40 attributes (we only use 34 
attributes in our experiments). To perform experiments on smaller 
data sizes, we randomly pick a sample of tuples. 

We tested two variants of Algorithm Repair_Data_FDs: 
A*-Repair which uses the A*-based search algorithm described 
in Section |5T2] and Best-First-Repair which uses a best- 
first search to obtain FD repairs, as we described in Section [5] 
Both variants use Algorithm [4] to obtain the corresponding data re- 
pair. We use the number of distinct values to measure the weights 
of sets of attributes appended to LHS's of FDs (i.e., w(Y) = 
Fcount(Y)^-Y (7) ). In our experiments, we adjust the relative 
threshold r r , rather than the absolute threshold r. We also im- 
plemented the repairing algorithm introduced in [5], which uses a 
unified cost model to quantify the goodness of each data-FD repair 
and obtains the repair with the (heuristically) minimum cost. 

In order to assess the quality of the generated repairs, we first use 
an FD discovery algorithm to find all the minimal FDs with a rela- 
tively small number of attributes in the LHS (less than 6). In each 
experiment, we randomly select a number of FDs from the discov- 
ered list of FDs. We denote by 7 C and E c the clean database in- 
stance and the FDs, respectively. The data instance 7 C is perturbed 
by changing the value of some cells such that each cell change re- 
sults in a violation of an FD. Specifically, we inject two types of 
violations as follows. 

• Right-hand-side violation: We first search for two tuples 
ti, tj that agree on XA for some FD X — > A £ E. Then, 
we modify ti[A] to be different from tj [A]. 

'http://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD) 
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• Left-hand-side violation: We search for two tuples ti , tj such 
that for some FD X -> A, U[X \ {B}] = tj\X \ {B}], 
tA B ] + t A B \ and ti[A] / t/[A], where B G X. We intro- 
duce a violation by setting [B] to t,- [B] . 

We refer to the resulting instance as Id- In our approach, we con- 
centrate on one method of fixing FDs, which is appending one or 
more attributes to LHS's of FDs. Therefore, we perform FDs per- 
turbation by randomly removing a number of attributes from their 
LHS's. The perturbed set of FDs is denoted Ed. The cleaning al- 
gorithm is applied to (Ed, Id), and the resulting repair is denoted 
(E r , I r ). The parameters that control the perturbation of data and 
FDs are (1) Data Error Rate, which is the fraction of cells that are 
modified, and (2) FD Error Rate, which is the fraction of LHS at- 
tributes that were removed. We use the following metrics to mea- 
sure the quality of the modified data and FDs. 

• Data precision: the ratio of the number of correctly modified 
cells to the total number of cells modified by the repair algo- 
rithm. A modification of a cell t[A] is considered correct if 
the values of t[A] in I c and Id are different, and either t[A] 
in I r is equal to t[A] in I c , or t[A] is a variable in I r . 

• Data recall: the ratio of the number of correctly modified 
cells to the total number of erroneous cells (i.e., cells with 
different values in Id and I c ). 

• FD precision: the ratio of the number of attributes correctly 
appended to LHS's of FDs in Ed to the total number of ap- 
pended attributes. 

• FD recall: the ratio of the number of attributes correctly ap- 
pended to LHS's of FDs in Ed to the total number of at- 
tributes removed from E c while constructing Ed. 

In order to measure the overall quality of a repair (E r , I r ), we 
compute the harmonic averages of precision and recall for both data 
and FDs (also called F-scores). Then, we compute the average F- 
score for data and FDs, which we refer to as the combined F-score. 

8.2 Impact of Relative Trust on Repair Qual- 
ity 

In this experiment, we measure the combined F-score at various 
error rates. We use 5000 tuples from the Census-Income data set 
to represent the clean instance I c , and we use an FD with 6 LHS 
attributes to represent E c . Figure [7] shows the combined F-score 
for various data sets, for multiple values of r r . When only FDs 
perturbation is performed, we notice that the peak quality occurs at 
TV = 0% (i.e., when no changes to data are allowed). At FD error 
rate of 50%, we notice that the peak quality occurs at r r = 17%. At 
30% FD error rate and 5% data error rate, the peak quality occurs at 
higher value (r r = 28.9%). Finally, when only data perturbation is 
performed, the peak quality occurs at r = 100% (i.e., the algorithm 
can freely change the data, while obtaining the cheapest FD repair, 
which is the original FD). 

In Figure [8] we compare the quality of the repairs generated 
by our algorithm, denoted Relative-Trust Repairing, to the qual- 
ity of repairs generated by the repairing algorithm in ||5], denoted 
Uniform-Cost Repairing. For both algorithms, we tested multiple 
parameter settings and we reported the quality metrics of the re- 
pair with the highest combined F-score. For example, for FD error 
of 50% and data error of 5% our algorithm achieved the maximum 
combined F-score of 0.26 at r = 17%. For all data sets, we noticed 
that the algorithm in 1 5 1 did not choose to modify the FD using any 
parameter settings, resulting in FD precision of 1 and recall of 
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Figure 7: Repair quality at multiple error rates 
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Figure 8: The maximum quality achievable by our algorithm 
and the algorithm in |5j 

for the first three data sets, and recall of 1 for the fourth data set. 
Because our algorithm is aware of the different levels of relative 
trusts, we were able to achieve higher quality scores when choos- 
ing the appropriate value of r. This is clear in the first data set with 
FD error of 80% and data error of 0%. Insisting on modifying the 
data, not the FD, resulted in FD recall of 0, and data precision of 
0. On the other hand, when setting r to 0%, our algorithm kept the 
data unmodified, resulting in perfect data precision and recall, and 
changed the FD, resulting in FD precision of 0.5 and FD recall of 
0.4. 

Note that, in general, the precision and recall for data repairs is 
relatively low due to the high uncertainty about the right cells to 
modify. For example, given an FD A — >• B, and two violating 
tuples t\ and t2, we have four cells that can be changed in order 
to repair the violation: ti[A], ti[B], tal-A], and fel-B]. This can 
be reduced by considering additional information such as the user 
trust in various attributes and tuples (e.g., j4] [5] |10| ). Using this 
kind of information is not considered in our work. 

8.3 Performance Results 

8.3.1 Scalability with the Number of Tuples 

In this experiment, we show the scalability of our algorithms 
with respect to the number of tuples. We use two FDs, and we set 
7y to 1%. Figures [5Ja) and^b) show the running time, and the 
number of visited states, respectively, against the number of tuples. 
When increasing the number of tuples in the range [0, 20000], the 
number of unique difference sets increases, while the average fre- 
quency of difference sets remains relatively small, compared to r. 
It follows that the computed lower bounds gc(S') are very loose be- 
cause most difference sets considered by Algorithm [3] can be left 
unresolved (i.e., the condition in line 8 is true). Thus, the search 
algorithm needs to visit more states, as we show in Figure|9|b). 

When the number of tuples increases beyond 20000, we notice 
in Figure [9] that the running time, as well as the number of visited 
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Figure 9: Performance at various instance sizes 
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states, decreases. The reason is that, in the state searching algo- 
rithm, the number of distinct difference sets stabilizes after reach- 
ing a certain number of tuples, and the frequencies of individual 
difference sets start increasing. It follows that most difference sets 
can no longer remain unresolved, and tighter lower bounds gc(5) 
are reported, which leads to decreasing the number of visited states 
(Figure^)). 

Algorithm Best-First-Repair does not depend on cost es- 
timation, and thus, the execution time rapidly grows with the num- 
ber of tuples in the entire range [0, 60000]. 

8.3.2 Scalability with the Number of Attributes 
Figure [lOjdepicts the scalability of our approach with respect to 

the number of attributes. In this experiment, we used two FDs and 
24000 tuples, and we set tv to 1%. We changed the number of 
attributes by excluding some number of attributes from the input 
relation. The running time increases with the number of attributes 
mainly because the size of state space increases exponentially with 
the number of attributes. 

8.3.3 Scalability with the Number of FDs 

Figure [TT| depicts the scalability of our approach with re- 
spect to the number of FDs. In this experiment, we used 
10000 tuples, and we set r r to 1%. We use a single FD, and 
we replicate this FD multiple times to simulate larger sizes of 
E. The size of state space grows exponentially with the num- 
ber of FDs. Thus, the searching algorithm visits more states, 
which increases the overall running time for both approaches: 
A*-Repair and Best-First-Repair. Note that the algo- 
rithm Best-First-Repair did not terminate in 24 hours when 
the number of FDs is greater than 2. 

8.3.4 Effect of the Relative Trust Parameter r 
Figures |12[a) and [T2fb) show the running time and the num- 
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Figure 13: Performance under uncertain relative trust 

ber of visited states, respectively, for various values of tv. In 
this experiment, we fix the number of tuples to be 5000, and we 
use Ed with one FD. The number of appended attributes ranges 
from 9 at r T = 10% to 1 at r r = 99%. No repair could be 
found for r r less than 10%. We notice that at small values of 
r, Algorithm A* -Repair is orders of magnitude faster than Al- 
gorithm Best-First-Repair. This is due to the effective- 
ness (i.e., tightness) of the cost estimation implemented in Algo- 
rithm ^4*-Repair. The lack of such estimation causes Algorithm 
Best-First-Repair to visit many more states. 

As the value of tv increases up to 55%, we observe that Al- 
gorithm v4*-Repair becomes slower. The reason is that larger 
values of r r decreases the tightness of computed bounds gc(S). As 
T r increases beyond 55%, we notice an improvement in the run- 
ning time as we only need to add very small number of attributes 
to reach a goal state. 

8.3.5 Generating Multiple Repairs 

In this experiment, we assess the efficiency of two approaches 
that generate possible repairs for a given range of r r . In the first 
approach, denoted Range-Repair, we execute Algorithm|6] and 
we invoke the data repair algorithm ( Algorithm^ for each obtained 
FD repair. In the second approach, denoted Sampling-Repair, 
we invoke the algorithm A*-Repair at a sample of possible val- 
ues of r r . In this experiment, we used 5000 tuples, and one FD. 
We set the minimum value of r r to 0, and we varied the upper 
bound of t in the range [10%, 30%], which is represented by the 
X-axis in Figure [13] For the sampling approach, we started by 
7> = 0%, and we increased tv in steps of 1.7% (which is equal 
to 13 in this experiment) until we reach the maximum value of tv. 
Figure[T3]shows the running time for both approaches. We observe 
that Range-Repair outperforms the sampling approach, espe- 
cially at wide ranges of tv. For example, for the range [0, 30%], 
Range-Repair is 3.8 times faster than Sampling-Repair. 

9. RELATED WORK 

The closest work to ours is (51, which proposed a technique to 
obtain a single repair, (E',7'), of the FDs and the data, respec- 
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tively, for a given input (E, 7). A unified cost model was proposed 
to measure the distance between a repair (E',/') and the inputs 
(E, I). An approximate algorithm was presented that obtains a re- 
pair with the minimum cost. There are many differences between 
our work and [ 5 1 including: 1) we incorporate the notion of relative 
trust in the data cleaning process and produce multiple suggested 
repairs corresponding to various levels of relative trust; 2) [5 1 does 
not give any minimality guarantees for the generated repairs; 3) 
the algorithm proposed in 1 5 1 searches a constrained repair space 
that only considers adding single attributes to LHS's of FDs in E, 
while we explore a larger repair space that considers appending any 
subset of R to the LHS of each FD. 

The idea of modifying a supplied set of FDs to better fit the data 
was also discussed in [8]. The goal of that work was to generate a 
small set of Conditional Functional Dependencies (CFDs) by mod- 
ifying the embedded FD. Modifying the data and relative trust were 
not discussed. 

The problem of cleaning the data in order to satisfy a fixed set 
of FDs has been studied in, e.g., |3 4 10, 6|. In our context, these 
solutions may be classified as having a fixed threshold r r of 100%. 
Part of our work is inspired by the algorithm proposed in (3) in 
the sense that we incrementally modify the data until there are no 
inconsistencies left. However, we modify individual tuples instead 
of attribute values. Also, unlike the approach in 1 3 1, we provide an 
upper bound on the number of changes. 

Another related problem is discovering which FDs hold (approx- 
imately or exactly) on a fixed database instance; see, e.g., 1 12|| I 1 1 
|14| [9j. There are two important differences between these ap- 
proaches and our work: 1) instead of discovering the FDs from 
scratch, we start with a set of provided FDs which have a certain 
level of trust, and we aim for a minimal modification of the pro- 
vided FDs that yields at most r violations; 2) in previous work, 
there are only "local" guarantees on the goodness (i.e., the number 
of violating tuples) of each FD, whereas in this paper we must make 
"global" guarantees that the whole set of FDs cannot be violated by 
more than r tuples. Thus, existing techniques for FD discovery are 
not applicable to our problem. 

10. CONCLUSIONS 

In this paper, we studied a data quality problem in which we are 
given a data set that does not satisfy the specified integrity con- 
straints (namely, Functional Dependencies), and we are uncertain 
whether the data or the FDs are incorrect. We proposed a solu- 
tion that computes various suggestions for how to modify the data 
and/or the FDs (in a nearly-minimal way) in order to achieve con- 
sistency. These suggestions cover various points on the spectrum 
of relative trust between the data and the FDs, and can help users 
determine the best way to resolve inconsistencies in their data. We 
believe that our relative trust framework is relevant and applicable 
to many other types of constraints, such as conditional FDs (CFDs), 
inclusion dependencies and logical predicates. In future work, we 
plan to develop cleaning algorithms within our framework for these 
constraints. 
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APPENDIX 
A. PROOFS 

A.l Proof of Theorem H 

In the following, we prove the first part of the theorem. Let 
(E',7') be a r-constrained repair. It follows that no repair 
(E",J") has (dis£ c (S, E"), distd{I, I")) -< (dist c (E,E'),r). 
Because distd(I , I') < r, there is no repair (E", J") satisfies 
(dist c (T,,'£"),dist d (I,I")) -< {dist c (Y;,E'),dist d {I,I')). In 
other words, (E', /') is a minimal repair. 

We prove the second part by contradiction. Assume that (E', /') 
is a minimal repair but it is not a r-constrained repair for the val- 
ues of r described in Equation[T| Because r > distd(I , I'), and 
based on DefinitionE] there must exist a repair (E^, I x ) such that 
(dist c (T,, Ez), dist7[I, I x )) -< (dist c (E, E'), r) (if multiple re- 
pairs satisfy this criteria, we select the repair with the minimum 
distance to I, and we break ties using the smaller distance to E). 
Repair (E x , I x ) is a minimal repair because no other repair can 
dominate (E x , I x ) with respect to distances to / and E. Because 
(E', /') is a minimal repair, then distd(I, I x ) > distd(I, I') (oth- 
erwise, (E', I') would be dominated by (E x , I x )). However, exis- 
tence of (Ex, I x ) contradicts the fact that no minimal repair exists 
with distance to / in the range (distd(I, I'), r) (based on the value 
of t obtained by Equation [TJ. 

A.2 Proof of Theorem H 

For a generated repair (E',/'), the condition distd{I , I') < 
r holds due to the constraint <5 opt (E',J) < r in line 1. 
For any E" G 5(E), 5 opt (Z",I) < dist d (I,I") for all 
J" |= E", and thus the condition dist d {I,I") < r im- 
plies that 5 op t(T>" , I) < t. Therefore the condition JE" £ 
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S(E)(5 apt (E", I) < t A dist c (Yl, E") < disi c (E, E')) in line 
1, along with the tie breaking mechanism, imply that J (E", /") G 
U {dist c {Y.,?."),dist d (I,I")) ^ (dist c (E,E'),r). Thus, 
(E', /') is a r-constrained repair. 

A.3 Proof of Lemma [J 

Let E be the set of FDs corresponding to S. Assume that we 
are using the entire set of difference sets, denoted T> a ii, that violate 
E rather than using a subset of difference sets (line 13 in Algo- 
rithm]^. 

The cheapest goal state S g that are a descendent 
of S will be among the states returned by the proce- 
dure getDescGoalStates because the procedure 
getDescGoalStates returns all minimal goal states (if 
any), and S g is minimal (i.e., there exist no other state S" such that 
S g extends S' and S' is a goal state). 

Because we are using a subset of all difference sets T> a u , the cost 
of the reported cheapest goal state is less than or equal to the actual 
cost of the cheapest goal state. 

A.4 Proof of Lemma |2] 

We first prove the soundness of the algorithm. That is, we 
need to prove that if a tuple t c is returned, t c [A] = t[A], for 
A G Fixed_Attrs, and for all t' e I' \ C 2o pt, (*<=,*') do not 
violate E. 

From the algorithm description, it is clear that the condition 
t c [A] = t[A] holds, for A G Fixed_Attrs. Also, the condition 
in line 2 ensures that whenever a tuple t c is returned, there does not 
exist t' £ I' \ C2o P t such that (t c , t') violate any FD in E. 

We prove the completeness of the algorithm by contradiction. 
Assume that Algorifhm[5]returns <f>, while there exist a tuple t g that 
satisfies the conditions t g [A] = t[A], for A G Fixed-Attrs, and 
{t g , t') satisfy E, for all t' 6 I' \ C 2opt . 

We first show that, just before returning <f> at line 4, all attributes 
in Fixed_Attrs have equal values in tuples t c and t g . This is 
clearly true for the initial value of Fixed_Attrs. Let A be the 
attribute that is first inserted in Fixed-Attrs in line 7. Before set- 
ting t c [A] to t'[A] in line 6, there exist a tuple t' £ I' \ Cio-pt 
such that (t',t c ) violate an FD X -> A G E. Attributes in X 
belong to Fixed_Attrs because attributes outside Fixed_Attrs 
are assigned to new variables (line 1) and cannot be equal to at- 
tributes of any other tuples. It follows that attribute A in any 
valid solution must be equal to t'[A] in order to satisfy E. Thus, 
t c [A] — tg [A] = t' [A] . The same argument is valid for attributes 
that are successively inserted into Fixed-Attrs before returning 
cf>. When the algorithm returns (f> (line 4), there exists a tuple 
f € I'\ C 2op t such that (i' , t c ) violate an FD X -> A G E and at- 
tribute A belongs to Fixed_Attrs. Because AX C Fixed_Attrs, 
and attributes in Fixed-Attrs have equal values in t c and t g , it 
follows that (t' ,t g ) violate X — > A as well (i.e., t g is not a valid 
answer), which contradicts our initial assumption. 

A.5 Proof of Theorem |3] 

We first prove that the returned /' satisfies E'. Let G be the con- 
flict graph of I with respect to E' and let C2 op t be a 2-approximate 
minimum vertex cover of G that is obtained at line 2 in Algo- 
rithm]^ The tuple set I \ C 2op t satisfies E', and thus the corre- 
sponding tuples in J' satisfy E' as well. For each tuple t that is 
randomly picked from G 2ov t in line 5 in AlgorifhmH] modifying t 
as described in lines 6-15 makes the set /' \ C2 op t U {t} satisfies E', 
as we show in the following. We observe that for a Fixed_Attrs 
containing a single attribute A, there exists an assignment to the 
attributes R \ {A} in t such that /' \ C2 op t U {t} satisfies E' (i.e., 



t c cannot be (f> at line 7). We describe one possible assignment as 
follows. If the value of t[A] does not appear in attribute A of any 
tuple in I' \ C 2 o P t, then setting attributes R \ {A} to new variables 
is a valid assignment. Otherwise, let t r be a tuple in /' \ C2 op t 
such that t[A] = t r [A]. Setting attributes R \ {A} in t to the val- 
ues of corresponding attributes in t r is a valid assignment. Thus, t c 
cannot be <f> in line 7 due to completeness of Algorithm]?] which is 
proved in Lemma [2] 

After each iteration of the while loop in line 8, Algorithm [4] 
maintains a tuple t c such that current attributes in Fixed_Attrs 
have equal values in t c and the current version of t, and other at- 
tributes outside Fixed-Attrs in t c are assigned to values that make 
I' \ C2o P t U {t c } satisfies E' (due to soundness of Algorifhm[5]as 
proved in Lemma[2j. After inserting all attributes in Fixed_Attrs, 
t is equal to t c and thus J' \ C2 op t U {t} satisfies E'. After process- 
ing, and removing, all tuples from C2 op t, the resulting instance I' 
satisfies E'. 

We prove the approximate optimality of the algorithm as follows. 
Let C pt be a minimum vertex cover of G. The minimum number 
of cell changes <5 op t(E', I) must be greater than or equal to |C op t|. 
This can be proved by contradiction as follows. Assume that there 
exists an instance /' |= E' such that the number of changed cells 
in J' is less than |C op t|. Let T be the set of changed tuples in J'. T 
represents a vertex cover of G and |Tj < \C op t\, which contradicts 
minimality of C op t- 

In the following, we prove that the number of changed cells 
is \C 2 o P t\ ■ min{|i?.| - 1, |E|}, which is 2 ■ min{|J?| - 1, |E|}- 
approximate minimum, based on the fact that <5 op t(E', J) > |C pt|. 
The algorithm changes only attributes of tuples in C2 op t- Further- 
more, we prove that the number of changed cells in each tuple in 
C2 op t is at most min{|_R| — 1, |E|}. It is clear that the maximum 
number of changed cells in each tuple is \R\ — 1 because the first 
attribute inserted into Fixed_Attrs cannot be changed (line 6 in 
Algorithm |4). 

We show that after changing |E'j attributes in t, the set F \ 
C2o P t U {t} satisfies E' and thus no other attributes in t need to 
be changed. In general, we prove that after the fc-th change to t, 
I' \ C 2o pt U {t} can violate at most |E'| - k FDs in E'. Let B 
be a changed attribute in t. If B was changed to a variable, there 
must exist an FD X — > A G E' such that B G X. The reason 
is that if B does not appear in any FD, it cannot be changed by 
Algorithm [4] and if B only appears as a right-hand-side attribute 
in FDs in E', it can only remain unchanged or be changed to a 
constant. It follows that (t,t') cannot violate X — > A, for all 
t' G /' \ G'2 op t after changing t[B] to a variable and adding B 
to Fixed-Attrs. If B was changed to a constant, there must ex- 
ist an FD X -> B G E' and another FD Y -> X implied by E' 
such that Y C Fixed-Attrs and values of attributes in Y are con- 
stants (refer to lines 2-9 in Algorithm|5j. In successive iterations, 
attributes in X cannot be assigned to values other than the current 
constants in t c , otherwise Y — > X would be violated. It follows 
that (t, t') cannot violate X — > B, for t' €. I'\ C2o P t in successive 
iterations. After changing |E'| attributes in t, we do not need to 
perform further changes. Because |E'| < |E|, it follows that the 
maximum number of attributes changed for each tuple in C2 op t is 
|E|, which completes the proof. 
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